In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Lily Moreno, Director of Marketing, believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
This project will answer the following key question:
How do casual riders and annual members use Cyclistic bikes differently?
In order to answer the key question, the following questions will be analysed:
The key stakeholders for this project are:
The key stakeholders are non-technical business leaders. This report will present findings in plain English and will be accompanied by an executive summary slide deck.
In this section, we provide a description of all data sources used and how the data is organised. We check for issues with bias or credibility and verify the data’s integrity. We check that the data will answer our questions, and that there are no problems with the data.
In order to answer the business question, this report uses Cyclistic’s historical trip data.1 This dataset contains the following information:
For this analysis we downloaded data for the last 12 months from the source above. This analysis covers the period from 1 January 2024 to 31 December 2024.
We read the data for the last 12 months by running the
1_read_data.R script located in the working directory.
source(here::here("R/1_read_data.R"))
# View the head of the data
head(df)
## # A tibble: 6 × 15
## filename ride_id rideable_type started_at ended_at
## <chr> <chr> <chr> <dttm> <dttm>
## 1 /Users/claregib… C1D650… electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 /Users/claregib… EECD38… electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 /Users/claregib… F4A9CE… electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 /Users/claregib… 0A0D9E… classic_bike 2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 /Users/claregib… 33FFC9… classic_bike 2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 /Users/claregib… C96080… classic_bike 2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 10 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # short_filename <chr>
We run code to produce a profile of each column within the data set.
skimr::skim_without_charts(df)
| Name | df |
| Number of rows | 5860568 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| filename | 0 | 1.00 | 128 | 128 | 0 | 12 | 0 |
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5860357 | 0 |
| rideable_type | 0 | 1.00 | 12 | 16 | 0 | 3 | 0 |
| start_station_name | 1073951 | 0.82 | 10 | 64 | 0 | 1808 | 0 |
| start_station_id | 1073951 | 0.82 | 3 | 35 | 0 | 1763 | 0 |
| end_station_name | 1104653 | 0.81 | 10 | 64 | 0 | 1815 | 0 |
| end_station_id | 1104653 | 0.81 | 3 | 35 | 0 | 1768 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| short_filename | 0 | 1.00 | 6 | 6 | 0 | 12 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.91 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 7232 | 1 | 41.90 | 0.06 | 16.06 | 41.88 | 41.90 | 41.93 | 87.96 |
| end_lng | 7232 | 1 | -87.65 | 0.11 | -144.05 | -87.66 | -87.64 | -87.63 | 152.53 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2024-01-01 00:00:39 | 2024-12-31 23:56:49 | 2024-07-22 20:36:16 | 5649600 |
| ended_at | 0 | 1 | 2024-01-01 00:04:20 | 2024-12-31 23:59:55 | 2024-07-22 20:53:59 | 5652165 |
From the data profile, we can see that there are >1M records with missing data for start and end stations. We need to better understand where these missing records occur. First let’s try counting the number of missing records by filename.
df |>
filter(is.na(start_station_id) | is.na(end_station_id)) |>
count(short_filename)
## # A tibble: 12 × 2
## short_filename n
## <chr> <int>
## 1 202401 31065
## 2 202402 38428
## 3 202403 71409
## 4 202404 117227
## 5 202405 167325
## 6 202406 216395
## 7 202407 208031
## 8 202408 214424
## 9 202409 284042
## 10 202410 167167
## 11 202411 89104
## 12 202412 47642
There is missing data within every file. Let’s check if the missing data is restricted to a specific bike type.
df |>
filter(is.na(start_station_id) | is.na(end_station_id)) |>
count(rideable_type)
## # A tibble: 3 × 2
## rideable_type n
## <chr> <int>
## 1 classic_bike 7489
## 2 electric_bike 1548260
## 3 electric_scooter 96510
Here we can see that most of the missing station data is coming from electric bikes. We can check to see if the latitude and longitude associated with the missing records can be matched to records without missing station data. If so, we can infer the station data from the non-missing records.
# How many stations are there for every distinct lat/long?
df |>
group_by(start_lat, start_lng) |>
mutate(stations = n_distinct(start_station_id)) |>
filter(stations > 1) |>
select(start_lat, start_lng, stations, start_station_id, start_station_name) |>
distinct() |>
arrange(start_station_id) |>
head(10)
## # A tibble: 10 × 5
## # Groups: start_lat, start_lng [8]
## start_lat start_lng stations start_station_id start_station_name
## <dbl> <dbl> <int> <chr> <chr>
## 1 41.8 -87.7 2 1042 Public Rack - Laflin St &51st …
## 2 42.0 -87.8 2 1277.0 Public Rack - Peterson Park
## 3 42.0 -87.8 2 1284.0 Public Rack - Forest Glen Stat…
## 4 41.9 -87.6 2 13001 Michigan Ave & Washington St
## 5 41.9 -87.6 3 13001 Michigan Ave & Washington St
## 6 41.9 -87.6 2 13001 Michigan Ave & Washington St
## 7 41.9 -87.7 5 13001 Michigan Ave & Washington St
## 8 41.9 -87.6 3 13006 LaSalle St & Washington St
## 9 41.9 -87.7 2 13006 LaSalle St & Washington St
## 10 41.9 -87.7 2 13008 Millennium Park
# How many lat/longs are there for every distinct station?
df |>
group_by(start_station_id) |>
mutate(lat_lngs = n_distinct(start_lat, start_lng)) |>
filter(lat_lngs > 1) |>
select(start_lat, start_lng, lat_lngs, start_station_id, start_station_name) |>
distinct() |>
arrange(start_lat) |>
head(10)
## # A tibble: 10 × 5
## # Groups: start_station_id [2]
## start_lat start_lng lat_lngs start_station_id start_station_name
## <dbl> <dbl> <int> <chr> <chr>
## 1 41.6 -87.5 725 <NA> <NA>
## 2 41.6 -87.6 725 <NA> <NA>
## 3 41.6 -87.6 725 <NA> <NA>
## 4 41.6 -87.5 24 20215 Hegewisch Metra Station
## 5 41.6 -87.5 24 20215 Hegewisch Metra Station
## 6 41.6 -87.5 24 20215 Hegewisch Metra Station
## 7 41.6 -87.5 24 20215 Hegewisch Metra Station
## 8 41.6 -87.5 24 20215 Hegewisch Metra Station
## 9 41.6 -87.5 24 20215 Hegewisch Metra Station
## 10 41.6 -87.5 24 20215 Hegewisch Metra Station
The code above shows that there is a many-to-many relationship between station ID and lat/long, meaning that for every station ID there can be many lat/long values and for every lat/long value there can be many station IDs. This will make it difficult to infer missing station data, and therefore we will elect to exclude records with missing station data in the analysis of questions related to rental stations.
We can also review the station_id and
station_name columns to ensure that we have consistency
with the IDs and names. First, we set up a dataframe containing all of
the distinct combinations of station_id and
station_name.
start_stations <-
df |>
select(
station_id = start_station_id,
station_name = start_station_name
) |>
distinct()
end_stations <-
df |>
select(
station_id = end_station_id,
station_name = end_station_name
) |>
distinct()
stations <-
start_stations |>
bind_rows(end_stations) |>
distinct() |>
arrange(station_id)
We can use this dataframe to check that each station_id
is associated with a single station_name.
stations |>
count(station_id) |>
arrange(desc(n)) |>
filter(n > 1)
## # A tibble: 97 × 2
## station_id n
## <chr> <int>
## 1 Hubbard Bike-checking (LBS-WH-TEST) 3
## 2 1524189 2
## 3 20129 2
## 4 21322 2
## 5 21366 2
## 6 21371 2
## 7 21379 2
## 8 21393 2
## 9 23114 2
## 10 23187 2
## # ℹ 87 more rows
There are 97 station_ids that are associated with
multiple station_names. We’ll ned to account for this in
the data cleaning.
Next, we check whether each station_name is associated
with a single station_id.
stations |>
count(station_name) |>
arrange(desc(n)) |>
filter(n > 1)
## # A tibble: 49 × 2
## station_name n
## <chr> <int>
## 1 Albany Ave & Douglas Blvd 2
## 2 Artesian Ave & 55th St 2
## 3 Burling St & Diversey Pkwy 2
## 4 California Ave & 36th St 2
## 5 Campbell Ave & 51st St 2
## 6 Central Park Ave & Douglas Blvd 2
## 7 Cicero Ave & Wellington Ave 2
## 8 Fairfield Ave & 44th St 2
## 9 Francisco Ave & 47th St 2
## 10 Harding Ave & 26th St 2
## # ℹ 39 more rows
There are 49 station names that are associated with multiple station IDs.
We can also check the validity of the trip durations by calculating the number of minutes between the start and end times of each trip, and then reviewing the summary statistics for these values.
df |>
mutate(trip_duration = as.numeric(difftime(ended_at, started_at, units = "mins"))) |>
pull(trip_duration) |>
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2748.317 5.545 9.717 17.316 17.250 1559.933
Here we can see that most of the values lie between 5 minutes and 17 minutes, but there are some outliers. The minimum value is -2,748 minutes, which doesn’t make sense. It means that the end time is before the start time. We should plan to exclude any negative values of trip duration in our analysis.
df |>
mutate(trip_duration = as.numeric(difftime(ended_at, started_at, units = "mins"))) |>
filter(trip_duration < 0) |>
count()
## # A tibble: 1 × 1
## n
## <int>
## 1 227
There are 227 records with a negative trip duration.
The table below summarises our analysis of the integrity of this data set.
| Principle | Passing threshold | Pass/Fail | Comments |
|---|---|---|---|
| Reliable | Accurate, complete, unbiased | Fail | Data has missing and inconsistent values for station IDs and locations. Proceed with caution when using these fields |
| Original | Data comes from original source | Pass | Data is from original source |
| Comprehensive | Contains all critical information needed to answer the question | Pass | Data is provided for all trips, with the exception of missing data as noted above |
| Current | Current and relevant to the task at hand | Pass | Data is provided up to and including last month |
| Cited | Data comes from a known and credible source | Pass | Data is provided directly by Lyft, the company that operates the bike-share system |
Based on our analysis there are some reliability issues that we have to deal with. Where we need to analyse a question that relies on fields with missing values, the observations with missing data will first be excluded from the data set.
In this section, we document the steps taken to clean the data ready for analysis. In order to be ready for analysis the data will:
The data processing steps for this analysis are performed in the
R/2_process_data.R script.
Data is provided under licence from Lyft Bikes and Scooters, LLC↩︎